QC metrics has been used to justify the pipelien quality. Here we propose an analysis to evaluate the pipeline updates/changes by using Picard QC metrics.
The key input data for this test are two Picardtools QC metrics from two pipelines
First load functions we will use in this test
Here is the list of input parameters
source('/Users/jishuxu/Works/github/HCA/skylab/benchmarking/smartseq2/R/analysis_functions.R')
## inputs, python style
option_list <- list(
make_option(
"--bmetrics",
type = "character",
default = NULL,
help = " First/Base metrics file name",
metavar = "character"
),
make_option(
"--umetrics",
type = "character",
default = NULL,
help = " Second/Updated metrics file name",
metavar = "character"
),
make_option(
"--output_name",
type = "character",
default = "out",
help = "output file name [default= %default]",
metavar = "character"
),
make_option(
"--metKeys",
type = "character",
default = "out",
help = "A list of metrics to analyze",
metavar = "character"
)
)
# arg rendered by rmkdown
args<-strsplit(commandArgs(trailingOnly = TRUE),split=' ')
opt_parser <- OptionParser(option_list = option_list)
opt <- parse_args(opt_parser,args=args[[1]])
# load params
output_name <- opt$output_nameThe QC metrics files are the collection of Picardtools QC outputs, which are formated as N*M matrix The rows are QC metrics and columns are cells/samples.
Here is the list of metrics we will analysis with
## [1] "detected_ratio" "MT_ratio"
## [3] "PCT_PF_READS_ALIGNED" "PCT_PF_READS_IMPROPER_PAIRS"
## [5] "PCT_READS_ALIGNED_IN_PAIRS" "PCT_CODING_BASES"
## [7] "PCT_INTERGENIC_BASES" "PCT_INTRONIC_BASES"
## [9] "PCT_UTR_BASES" "PCT_USABLE_BASES"
## [11] "PCT_MRNA_BASES" "PCT_RIBOSOMAL_BASES"
## [13] "PERCENT_DUPLICATION" "MEDIAN_5PRIME_TO_3PRIME_BIAS"
## [15] "MEDIAN_3PRIME_BIAS" "MEDIAN_5PRIME_BIAS"
## [17] "MEDIAN_CV_COVERAGE" "PF_MISMATCH_RATE"
## [19] "MEDIAN_INSERT_SIZE"
First we match column names(cell IDs) between two pipelines metrics
colnames1 <- colnames(met1)
colnames2 <- colnames(met2)
mlist <- match(colnames1, colnames2)
nalist <- is.na(mlist)
if (sum(nalist) > 0) {
print("input files have different column elements")
exit()
}
met2 <- met2[, mlist]Then select the subset of QC metrics to run analysis.
Finally, we have two QC metrics to run analysis on.
To evaluate the pipeline changes/updates, we will carry out statistical tests on each QC metric and these tests are described and visualized as following:
Let’s run the analysis mentioned above on the set of QC metrics.
out <- c()
pouts <- list()
for (ii in 1:length(metKeys)) {
x <- as.numeric(met1.core[metKeys[ii], ])
y <- as.numeric(met2.core[metKeys[ii], ])
z <- data.frame('Base' = x, 'Updated' = y)
# linear regression model
p1 <- plotlm(z)
# plot histogram
p2 <- plothist(z)
# plot box
p3 <- plotBox(z)
# plot KS
p4 <- plotKS(z)
# test block
text <-
paste(
"A: Density plot of",
metKeys[ii],
" from two pipeline.",
"B: Violin of ",
metKeys[ii],
", labeled with Wilcoxon test result.",
"C: Linear regression test of ",
metKeys[ii],
"between two pipelines.",
"D: K-S test, D statistics is labeled",
sep = " "
)
cat(metKeys[ii],text,' \n')
# arranage plots in one figure
gp <-
ggarrange(
p2$p,
p3$p,
p1$p,
p4$p,
labels = c("A", "B", "C", "D")
) # arrange 4 plots
gp <-
gp + ggtitle(paste(metKeys[ii])) + theme(plot.title = element_text(
hjust = 0.5,
size = 15,
face = 'bold'
))
print(gp)
out <-
rbind(out,
c(
metKeys[ii],
p1$beta,
p1$a,
p1$r2,
p1$fpval,
p3$pval,
p4$D,
p4$pval
)) # test, statistics outputs
}## detected_ratio A: Density plot of detected_ratio from two pipeline. B: Violin of detected_ratio , labeled with Wilcoxon test result. C: Linear regression test of detected_ratio between two pipelines. D: K-S test, D statistics is labeled
## MT_ratio A: Density plot of MT_ratio from two pipeline. B: Violin of MT_ratio , labeled with Wilcoxon test result. C: Linear regression test of MT_ratio between two pipelines. D: K-S test, D statistics is labeled
## PCT_PF_READS_ALIGNED A: Density plot of PCT_PF_READS_ALIGNED from two pipeline. B: Violin of PCT_PF_READS_ALIGNED , labeled with Wilcoxon test result. C: Linear regression test of PCT_PF_READS_ALIGNED between two pipelines. D: K-S test, D statistics is labeled
## PCT_PF_READS_IMPROPER_PAIRS A: Density plot of PCT_PF_READS_IMPROPER_PAIRS from two pipeline. B: Violin of PCT_PF_READS_IMPROPER_PAIRS , labeled with Wilcoxon test result. C: Linear regression test of PCT_PF_READS_IMPROPER_PAIRS between two pipelines. D: K-S test, D statistics is labeled
## PCT_READS_ALIGNED_IN_PAIRS A: Density plot of PCT_READS_ALIGNED_IN_PAIRS from two pipeline. B: Violin of PCT_READS_ALIGNED_IN_PAIRS , labeled with Wilcoxon test result. C: Linear regression test of PCT_READS_ALIGNED_IN_PAIRS between two pipelines. D: K-S test, D statistics is labeled
## PCT_CODING_BASES A: Density plot of PCT_CODING_BASES from two pipeline. B: Violin of PCT_CODING_BASES , labeled with Wilcoxon test result. C: Linear regression test of PCT_CODING_BASES between two pipelines. D: K-S test, D statistics is labeled
## PCT_INTERGENIC_BASES A: Density plot of PCT_INTERGENIC_BASES from two pipeline. B: Violin of PCT_INTERGENIC_BASES , labeled with Wilcoxon test result. C: Linear regression test of PCT_INTERGENIC_BASES between two pipelines. D: K-S test, D statistics is labeled
## PCT_INTRONIC_BASES A: Density plot of PCT_INTRONIC_BASES from two pipeline. B: Violin of PCT_INTRONIC_BASES , labeled with Wilcoxon test result. C: Linear regression test of PCT_INTRONIC_BASES between two pipelines. D: K-S test, D statistics is labeled
## PCT_UTR_BASES A: Density plot of PCT_UTR_BASES from two pipeline. B: Violin of PCT_UTR_BASES , labeled with Wilcoxon test result. C: Linear regression test of PCT_UTR_BASES between two pipelines. D: K-S test, D statistics is labeled
## PCT_USABLE_BASES A: Density plot of PCT_USABLE_BASES from two pipeline. B: Violin of PCT_USABLE_BASES , labeled with Wilcoxon test result. C: Linear regression test of PCT_USABLE_BASES between two pipelines. D: K-S test, D statistics is labeled
## PCT_MRNA_BASES A: Density plot of PCT_MRNA_BASES from two pipeline. B: Violin of PCT_MRNA_BASES , labeled with Wilcoxon test result. C: Linear regression test of PCT_MRNA_BASES between two pipelines. D: K-S test, D statistics is labeled
## PCT_RIBOSOMAL_BASES A: Density plot of PCT_RIBOSOMAL_BASES from two pipeline. B: Violin of PCT_RIBOSOMAL_BASES , labeled with Wilcoxon test result. C: Linear regression test of PCT_RIBOSOMAL_BASES between two pipelines. D: K-S test, D statistics is labeled
## PERCENT_DUPLICATION A: Density plot of PERCENT_DUPLICATION from two pipeline. B: Violin of PERCENT_DUPLICATION , labeled with Wilcoxon test result. C: Linear regression test of PERCENT_DUPLICATION between two pipelines. D: K-S test, D statistics is labeled
## MEDIAN_5PRIME_TO_3PRIME_BIAS A: Density plot of MEDIAN_5PRIME_TO_3PRIME_BIAS from two pipeline. B: Violin of MEDIAN_5PRIME_TO_3PRIME_BIAS , labeled with Wilcoxon test result. C: Linear regression test of MEDIAN_5PRIME_TO_3PRIME_BIAS between two pipelines. D: K-S test, D statistics is labeled
## MEDIAN_3PRIME_BIAS A: Density plot of MEDIAN_3PRIME_BIAS from two pipeline. B: Violin of MEDIAN_3PRIME_BIAS , labeled with Wilcoxon test result. C: Linear regression test of MEDIAN_3PRIME_BIAS between two pipelines. D: K-S test, D statistics is labeled
## MEDIAN_5PRIME_BIAS A: Density plot of MEDIAN_5PRIME_BIAS from two pipeline. B: Violin of MEDIAN_5PRIME_BIAS , labeled with Wilcoxon test result. C: Linear regression test of MEDIAN_5PRIME_BIAS between two pipelines. D: K-S test, D statistics is labeled
## MEDIAN_CV_COVERAGE A: Density plot of MEDIAN_CV_COVERAGE from two pipeline. B: Violin of MEDIAN_CV_COVERAGE , labeled with Wilcoxon test result. C: Linear regression test of MEDIAN_CV_COVERAGE between two pipelines. D: K-S test, D statistics is labeled
## PF_MISMATCH_RATE A: Density plot of PF_MISMATCH_RATE from two pipeline. B: Violin of PF_MISMATCH_RATE , labeled with Wilcoxon test result. C: Linear regression test of PF_MISMATCH_RATE between two pipelines. D: K-S test, D statistics is labeled
## MEDIAN_INSERT_SIZE A: Density plot of MEDIAN_INSERT_SIZE from two pipeline. B: Violin of MEDIAN_INSERT_SIZE , labeled with Wilcoxon test result. C: Linear regression test of MEDIAN_INSERT_SIZE between two pipelines. D: K-S test, D statistics is labeled